In recent years, vision-centric perception has flourished in various autonomous driving tasks, including 3D detection, semantic map construction, motion forecasting, and depth estimation. Nevertheless, the latency of vision-centric approaches is too high for practical deployment (e.g., most camera-based 3D detectors have a runtime greater than 300ms). To bridge the gap between ideal research and real-world applications, it is necessary to quantify the trade-off between performance and efficiency. Traditionally, autonomous-driving perception benchmarks perform the offline evaluation, neglecting the inference time delay. To mitigate the problem, we propose the Autonomous-driving StreAming Perception (ASAP) benchmark, which is the first benchmark to evaluate the online performance of vision-centric perception in autonomous driving. On the basis of the 2Hz annotated nuScenes dataset, we first propose an annotation-extending pipeline to generate high-frame-rate labels for the 12Hz raw images. Referring to the practical deployment, the Streaming Perception Under constRained-computation (SPUR) evaluation protocol is further constructed, where the 12Hz inputs are utilized for streaming evaluation under the constraints of different computational resources. In the ASAP benchmark, comprehensive experiment results reveal that the model rank alters under different constraints, suggesting that the model latency and computation budget should be considered as design choices to optimize the practical deployment. To facilitate further research, we establish baselines for camera-based streaming 3D detection, which consistently enhance the streaming performance across various hardware. ASAP project page: https://github.com/JeffWang987/ASAP.
translated by 谷歌翻译
Direct speech-to-speech translation (S2ST), in which all components can be optimized jointly, is advantageous over cascaded approaches to achieve fast inference with a simplified pipeline. We present a novel two-pass direct S2ST architecture, {\textit UnitY}, which first generates textual representations and predicts discrete acoustic units subsequently. We enhance the model performance by subword prediction in the first-pass decoder, advanced two-pass decoder architecture design and search strategy, and better training regularization. To leverage large amounts of unlabeled text data, we pre-train the first-pass text decoder based on the self-supervised denoising auto-encoding task. Experimental evaluations on benchmark datasets at various data scales demonstrate that UnitY outperforms a single-pass speech-to-unit translation model by 2.5-4.2 ASR-BLEU with 2.83x decoding speed-up. We show that the proposed methods boost the performance even when predicting spectrogram in the second pass. However, predicting discrete units achieves 2.51x decoding speed-up compared to that case.
translated by 谷歌翻译
Recently, webly supervised learning (WSL) has been studied to leverage numerous and accessible data from the Internet. Most existing methods focus on learning noise-robust models from web images while neglecting the performance drop caused by the differences between web domain and real-world domain. However, only by tackling the performance gap above can we fully exploit the practical value of web datasets. To this end, we propose a Few-shot guided Prototypical (FoPro) representation learning method, which only needs a few labeled examples from reality and can significantly improve the performance in the real-world domain. Specifically, we initialize each class center with few-shot real-world data as the ``realistic" prototype. Then, the intra-class distance between web instances and ``realistic" prototypes is narrowed by contrastive learning. Finally, we measure image-prototype distance with a learnable metric. Prototypes are polished by adjacent high-quality web images and involved in removing distant out-of-distribution samples. In experiments, FoPro is trained on web datasets with a few real-world examples guided and evaluated on real-world datasets. Our method achieves the state-of-the-art performance on three fine-grained datasets and two large-scale datasets. Compared with existing WSL methods under the same few-shot settings, FoPro still excels in real-world generalization. Code is available at https://github.com/yuleiqin/fopro.
translated by 谷歌翻译
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.
translated by 谷歌翻译
增加片上光子神经网络(PNN)的层数对于改善其模型性能至关重要。但是,网络隐藏层的连续级联导致更大的集成光子芯片区域。为了解决此问题,我们提出了光学神经常规微分方程(ON-ON-ON-OD-ON-OD-ON-OD-ON-OD-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ODINE),该架构用光ODE求解器参数化了隐藏层的连续动力学。 On-Ode包括PNN,然后是光子积分器和光反馈回路,可以配置为代表残留的神经网络(RESNET)和复发性神经网络,并有效地降低了芯片面积占用率。对于基于干扰的光电非线性隐藏层,数值实验表明,单个隐藏层ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ON-ONE表示与图像分类任务中的两层光学重新系统大致相同。此外,Onode提高了基于衍射的全光线性隐藏层的模型分类精度。 On-Eod的时间依赖性动力学属性进一步应用于高精度的轨迹预测。
translated by 谷歌翻译
聚类是基于它们的相似性对组对象的重要探索性数据分析技术。广泛使用的$ k $ -MEANS聚类方法依赖于一些距离的概念将数据划分为较少数量的组。在欧几里得空间中,$ k $ -Means的基于质心和基于距离的公式相同。在现代机器学习应用中,数据通常是作为概率分布而出现的,并且可以使用最佳运输指标来处理测量值数据。由于瓦斯坦斯坦空间的非负亚历山德罗夫曲率,巴里中心遭受了规律性和非舒适性问题。 Wasserstein Barycenters的特殊行为可能使基于质心的配方无法代表集群内的数据点,而基于距离的$ K $ -MEANS方法及其半决赛计划(SDP)可以恢复真实的方法集群标签。在聚集高斯分布的特殊情况下,我们表明SDP放松的Wasserstein $ k $ - 金钱可以实现精确的恢复,因为这些集群按照$ 2 $ - WASSERSTEIN MERTRIC进行了良好的分离。我们的仿真和真实数据示例还表明,基于距离的$ K $ -Means可以比基于标准的基于质心的$ k $ -Means获得更好的分类性能,用于聚类概率分布和图像。
translated by 谷歌翻译
否决单图是一项普遍但又具有挑战性的任务。复杂的降雪降解和各种降解量表需要强大的代表能力。为了使否定的网络看到各种降雪并建模本地细节和全球信息的上下文相互作用,我们提出了一种称为Snowformer的功能强大的建筑。首先,它在编码器中执行比例感知功能聚合,以捕获各种降解的丰富积雪信息。其次,为了解决大规模降级,它使用了解码器中的新颖上下文交互变压器块,该互动器块在全球上下文交互中从前范围内的局部细节和全局信息进行了上下文交互。并引入本地上下文互动可改善场景细节的恢复。第三,我们设计了一个异质的特征投影头,该功能投影头逐渐融合了编码器和解码器的特征,并将精制功能投影到干净的图像中。广泛的实验表明,所提出的雪诺形雪孔比其他SOTA方法取得了重大改进。与SOTA单图像HDCW-NET相比,它在CSD测试集上将PSNR度量提高了9.2dB。此外,与一般图像恢复体系结构NAFNET相比,PSNR的增加5.13db,这验证了我们的雪诺形雪地降雪任务的强大表示能力。该代码在\ url {https://github.com/ephemeral182/snowformer}中发布。
translated by 谷歌翻译
自我监督的单眼方法可以有效地了解弱纹理表面或反射性对象的深度信息。但是,由于单眼几何建模的固有歧义,深度精度受到限制。相反,由于多视图立体声(MVS)的成功,多帧深度估计方法提高了深度准确性,后者直接使用几何约束。不幸的是,MV经常患有无纹理区域,非斜角表面和移动物体,尤其是在没有已知的相机运动和深度监督的现实世界视频序列中。因此,我们提出了MoveEpth,它利用了单眼线索和速度指导来改善多帧深度学习。与现有的MVS深度和单眼深度之间一致性的方法不同,MoveEpth通过直接解决MV的固有问题来增强多帧深度学习。我们方法的关键是利用单眼深度作为几何优先级来构建MVS成本量,并根据预测的相机速度的指导来调整成本量的深度候选。我们通过学习成本量的不确定性来进一步融合单眼深度和MVS深度,从而导致深度估计多视图几何形状的歧义。广泛的实验表明,移动eptth达到了最先进的性能:与monodepth2和packnet相比,我们的方法相对地将深度准确性提高了20 \%和19.8 \%,而Kitti基准测试的方法则提高了。 MoveEpth还推广到更具挑战性的DDAD基准测试,相对超过7.2 \%。该代码可在https://github.com/jeffwang987/movedepth上获得。
translated by 谷歌翻译
LIDC-IDRI数据库是肺癌预测的最流行的基准。但是,通过放射科医生的主观评估,LIDC中的结节可能与病理基础真理具有完全不同的恶性注释,从而引入了标签分配错误,并在培训期间引起了后续的监督偏见。因此,LIDC数据库需要更多的客观标签来基于学习的癌症预测。基于一个额外的小数据集,该数据集包含通过病理检查诊断的180个结节,我们建议重新标记LIDC数据,以减轻对此强大基准测试的原始注释偏差的影响。我们在本文中证明,基于度量学习的类似结节检索提供新标签将是一种有效的重新标记策略。对这些重新标记的LIDC结节进行的培训可改善模型性能,当添加不确定的结节的新标签时,这将增强。我们进一步推断出,重新标记的LIDC是最终的良好肺癌预测的方便方法,同时构建大型病理预处理的结节数据库提供了长期解决方案。
translated by 谷歌翻译
面部聚类是使用大型未标记的面部图像扩展面部识别系统的一种有希望的方法。识别我们称之为硬群的小或稀疏的面部图像簇仍然具有挑战性,这是由簇的异质性,\ ie,大小和稀疏性的高变化引起的。因此,使用均匀阈值(识别簇)的常规方式通常会导致对应该属于硬群的样品的可怕分类。我们通过利用样品的邻居信息并以概率方式推断(样本的)群集成员来解决这个问题。我们介绍了两个新型模块,分别是基于邻域扩散的密度(NDDE)和基于过渡概率的距离(TPDI),我们可以简单地将标准密度峰值聚类算法应用于均匀的阈值。我们对多个基准测试的实验表明,每个模块都会有助于我们方法的最终性能,并通过将其纳入其他高级面部聚类方法中,这两个模块可以将这些方法的性能提高到新的最先进。代码可在以下网址获得:https://github.com/echoanran/on-mitigating-hard-clusters。
translated by 谷歌翻译